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ABSTRACT 

The National Center for Biotechnology Information 
(NCBI) Reference Sequence (RefSeq) database is a 
collection of genomic, transcript and protein 
sequence records. These records are selected 
and curated from public sequence archives and 
represent a significant reduction in redundancy 
compared to the volume of data archived by the 
International Nucleotide Sequence Database 
Collaboration. The database includes over 16000 
organisms, 2.4 x10 6 genomic records, 13x10 6 
proteins and 2x10 6 RNA records spanning pro- 
karyotes, eukaryotes and viruses (RefSeq release 
49, September 2011). The RefSeq database is main- 
tained by a combined approach of automated 
analyses, collaboration and manual curation to 
generate an up-to-date representation of the 
sequence, its features, names and cross-links to 
related sources of information. We report here on 
recent growth, the status of curating the human 
RefSeq data set, more extensive feature annotation 
and current policy for eukaryotic genome annota- 
tion via the NCBI annotation pipeline. More informa- 
tion about the resource is available online (see 
http://www.ncbi.nlm.nih.gov/RefSeq/). 



broad taxonomic scope, reduced redundancy, informative 
cross-links between nucleic acid and protein records (both 
curated and computationally derived) and daily curation 
and maintenance. Data linkages include names, protein 
domains, orthologs, Enzyme Commission (E.C.) 
numbers, phenotypes and disease. Curation and mainten- 
ance reflect new information and enable the RefSeq col- 
lection to support numerous research directions, including 
associating sequence with phenotype, providing a stable 
and consistent coordinate system to report clinical vari- 
ation, comparative genomics and evolutionary studies. 
The RefSeq collection is a critical element of additional 
resources at NCBI, including dbSNP, dbVar, Gene, 
Genomes, Protein Clusters and Map Viewer, enabling 
the integration of these resources within and among 
organisms. 

The RefSeq database is a product of NCBI, a division 
of the National Library of Medicine at the US National 
Institutes of Health. Records are freely available by 
multiple methods, including Internet query, FTP down- 
loads, BLAST or scripted query using NCBI's 
E-Utilities. A comprehensive FTP release is available on 
a bi-monthly schedule with incremental daily updates 
provided between releases. RefSeq records can be 
identified by a distinct accession format which includes 
an underscore ('_') at the third position. More information 
is available online (http://www.ncbi.nlm.nih.gov/books/ 
NBK21091/). 



INTRODUCTION 

RefSeq integrates an organisms' genomic, transcript and 
protein sequence with descriptive feature annotation and 
bibliographic information (1,2). National Center for 
Biotechnology Information (NCBI) builds RefSeq from 
sequence data available in public archival sequence data- 
bases of the International Nucleotide Sequence Database 
Collaboration (INSDC, including the DNA Data Bank of 
Japan, the European Nucleotide Archive and GenBank). 
Unique features of the RefSeq collection includes its 



GROWTH OF THE REFSEQ DATA SET 

The comprehensive bi-monthly RefSeq release continues 
to grow as new genome and transcript sequence become 
publicly available. To support the needs of different 
research communities, the release is provided both com- 
prehensively in the 'complete' directory and based on 
general taxonomic groups, mitochondrial or plastid 
genomes or plasmid molecules. Release 49 (September 
2011) includes records from 16248 species representing 
13 137 813 protein records. Table 1 indicates an annual 
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Table 1. Annual Growth of the RefSeq release 
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''Release 43 included data available on 7 September 2010; release 49 included data available on 5 September 2011. 



Table 2. Distribution of RefSeq release 49 by ftp directory 
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growth of 49.7 and 14.5% in the number of organisms and 
the number of accessions, respectively. Records included 
in the release incorporate over 200 million feature anno- 
tation links (denoted 'db_xref=') to 60 different 
Web-based resources. These links allow navigation to 
related information from these resources, including those 
within NCBI, for e.g. Gene (3), the Conserved Domain 
database [CDD (4)], dbSNP (5) and externally, including 
nomenclature groups, model organism databases, 
protein-focused resources and many more. Links are 
managed by collaboration and propagation from the 
INSDC records upon which the RefSeq is based. 

Microbial organisms as a group account both for the 
greatest number of organisms and accessions in Release 49 
(Table 2) and displayed the most significant annual 
growth in number of organisms (85.2%; Table 1). Note, 
however, that the number of microbial group accessions 
increased by only 18.6%. This value is skewed downward 
relative to the growth in the number of organisms or RNA 
records. Release 49 actually saw a 156% increase in the 
number of microbial RNA records (data not shown); this 
reflects activity of the RefSeq Targeted Locus project 
(http://www.ncbi.nlm.nih.gov/genomes/static/refseqtarget 
.html), whose mandate is to provide a single representative 
16S ribosomal RNA sequence for bacterial and archaeal 



genomes and strains. Release 49 included 6949 organisms 
with a single record, 5680 organisms with more than one 
but fewer than 100 accessions and 184 organisms with 
more than 10000 accessions. 



STATUS OF CURATING HUMAN REFSEQ 
RECORDS 

NCBI staff actively curate several subsets of the RefSeq 
collection for Homo sapiens. Curation improves multiple 
aspects of the human RefSeq collection by (i) providing 
quality reference sequence records for genomic regions, 
transcripts and proteins; (ii) maintaining and expanding 
functionally relevant information integrated into both 
RefSeq records and NCBI's Gene database; 
(hi) communicating and coordinating with international 
curation groups to generate a unified, consistent view of 
human genes and their primary products (see the 
Consensus CDS (CCDS) project, http://www.ncbi.nlm 
.nih.gov/CCDS/CcdsBrowse.cgi); and (iv) supporting the 
scientific community in response to suggestions, questions 
or error reports. 

Genomic regions 

RefSeq provides region-specific genomic region records 
for non-transcribed pseudogenes and for the 
RefSeqGene project (http://www.ncbi.nlm.nih.gov/ 
refseq/rsg/). Pseudogene loci are defined through collab- 
oration with the HUGO Gene Nomenclature Committee 
[HGNC (6)] downloaded from Pseudogene.org [http:// 
pseudogene.org/ (7)], or defined by RefSeq curation staff 
when reviewing transcripts having more than one 
high-quality alignment to the human genome. Curation 
involves defining the length and location of the pseudo- 
gene locus, determining whether it is transcribed and 
providing a link between the pseudogene locus and a func- 
tional homolog. As an example, please see NG_002746.2, 
which represents a eukaryotic translation initiation factor 
pseudogene and the 'General gene information' section of 
its Gene record (GenelD 1986) where a link to the related 
functional gene (EIF5A, GenelD 1984) is provided. 
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Additionally, putative exon regions are annotated on the 
pseudogene RefSeq record based on alignment to a 
RefSeq transcript of the functional homolog. The 
number of non-transcribed pseudogene records increased 
by 7.7% in the past year. 

RefSeqGene, as part of the international Locus 
Reference Genomic initiative [LRG (8)], provides stable, 
gene-specific human genomic sequence records for 
reporting sequence variation in medical records and 
locus-specific databases (see http://www.ncbi.nlm.nih 
.gov/refseq/rsg/). The RefSeqGene and LRG records 
often represent explicitly only a subset of the known 
mRNA and coding regions. Identification of the sequences 
to use as standards depends on evaluation by the user 
base, but usually corresponds to the RefSeq transcript 
and protein records that have already been curated and 
reviewed by RefSeq and CCDS staff. If a question arises, 
review of evidence from the stakeholder, the literature and 
sequence evidence may result in an update to a 
RefSeqGene record, revision of the reference transcripts 
and proteins annotated on the RefSeqGene record or add- 
itional splice variants to represent in the RefSeq collection 
before assigning the LRG identifier. Transcript variants 
and protein isoforms not part of the explicit annotation 
are represented by alignments which can be seen in 
NCBFs graphical display. The number of RefSeqGene 
records grew by 25.8% in the past year. To request a 
RefSeqGene for a gene, contact rsgene@ncbi.nlm.nih.gov. 

Transcripts and proteins 

Transcripts and proteins are an important focus for 
curation at NCBI. This data set has two major 
categories — the 'model' subset generated directly by 
NCBFs genome annotation pipeline and the 'known' 
subset maintained independently of the genome annota- 
tion process using a combination of automated analyses 
and manual review. These subsets can be distinguished by 
the accession number prefix (models begin with 'X') as 
well as by the annotation in the COMMENT block of 
the record when viewed in flatfile format (see http:// 
www.ncbi.nlm.nih.gov/books/NBK21091/ for additional 
details). Records in the model subset are created or 
updated only upon whole-genome re-annotation but 
may be removed from the collection following manual 
review between such updates. The human model set was 
reviewed last year, which resulted in revision of the gene 
type designation (e.g. protein-coding, non-coding, pseudo- 
gene, etc.), replacement of model records with known 
records and removal of records considered to be insuffi- 
ciently supported. For example, 2068 model RefSeqs met 
the evidence criteria to be replaced by a known RefSeq 
type in the 1-year period between releases 43 and 49. 

A series of status codes are annotated on the known 
RefSeq data set to indicate information about the level 
of curation (these codes are not applicable for the model 
record subset). Records with a status of either 'validated' 
or 'reviewed' are considered to be curated. As of RefSeq 
release 49, 92.5% of the human protein coding transcripts 
(and their associated proteins) are tracked with a curated 
status, and 57.2% of the non-coding transcripts are 



Table 3. Current status of human transcripts and proteins 



Type Accessions in Release 49 





Total 


Curated" 


Percent 
curated 


Known protein-coding transcripts 


31933 


29 531 


92.5 


Model protein-coding transcripts 


1118 


NA 




Known non-coding transcripts 


5932 


3396 


57.2 


Model non-coding transcripts 


3762 


NA 




Total 


42 745 


32 927 


77.0 



"Curated records have a review status of 'Validated' or 'Reviewed' 
which is not applied to model RefSeq records. 



tracked with a curated status (Table 3). This includes 
curation to add or update over 7500 human transcript 
records between releases 43 and 49. RefSeq continues to 
represent protein-coding regions that are considered to be 
full length, and transcripts that are considered to be at 
least near complete. Transcripts that are obviously 
partial are not represented but are presented in NCBFs 
genome browser (Map Viewer). 

NCBI staff coordinates closely with other major data- 
bases and curation groups to maximize consistent data 
representation at NCBI and other web sites. Fhe 
Consensus Coding Sequence collaboration [http://www 
.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi (9)] is 
a central hub for curation of protein-coding loci and all 
members must agree to updates affecting the genomic 
coordinates of a CDS. Ambiguous or complex cases are 
discussed among CCDS members in light of available sup- 
porting evidence and published reports to achieve consen- 
sus on the likely annotated protein product. CCDS review 
often includes communication and coordination with the 
HGNC, UniProt or the Genome Reference Consortium 
[http : / /www . ncbi . nlm . nih . go v/proj ects/genome /assembly / 
grc/ (10)] when annotation cannot be well represented due 
to a concern over the sequence represented in the reference 
human genome assembly. Fhe human CCDS data set was 
updated twice in the past year, adding 2126 CCDS IDs for 
456 genes. Fhe database currently includes 26473 distinct 
human protein identifiers corresponding to 18471 genes 
and is available at http://www.ncbi.nlm.nih.gov/CCDS/. 

In addition to ongoing review of transcripts and 
proteins, several changes affecting the content of RefSeq 
transcript and protein records were implemented recently. 
Fhese include: 

• new policy for management of protein names; 

• new policy for management of readthrough (con- 
joined) transcripts; 

• expanded representation of non-coding RNAs to 
include microRNAs; and 

• expanded feature annotation for transcript and protein 
records. 

Protein names. RefSeq is in the process of adopting 
UniProtKB (11) guidelines for protein naming (http:// 
www.uniprot.org/docs/gennameprot) for both prokary- 
otic and eukaryotic records. Implementation of this 
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policy is relatively new and remains variable across taxa. 
For prokaryotic proteins, protein name curation occurs in 
conjunction with NCBI's Protein Clusters resource 
(http://www.ncbi.nlm.nih.gov/proteinclusters). For verte- 
brate proteins associated in NCBI's Gene database with 
a related Swiss-Prot accession number, the Swiss-Prot 
preferred name is used verbatim. For some vertebrate 
RefSeq records, a distinct protein name continues to be 
provided if a Swiss-Prot name is not available, or one does 
not adhere to the revised UniProt guidelines. 

Readthrough transcripts. Transcripts representing exons 
from what are typically considered to be neighboring, 
yet distinct loci pose a particular challenge for curation. 
This category of data resulted in conflicting annotation, 
requiring extensive discussion among the CCDS collabor- 
ation members, as well as HGNC. NCBI and the CCDS 
collaboration recently defined an annotation policy that 
tracks most readthrough transcript as a distinct locus as it 
is not solely the product of either of the two underlying 
loci. This approach improves consistency of the gene 
extent annotated for the two smaller loci while also reflect- 
ing the transcriptional complexity of the region. RefSeq 
opts to annotate the readthrough transcript when it 
appears to be full length and there is a minimum of two 
independent lines of support for the readthrough event. 
The Gene database reports this transcriptional complexity 
in the 'General gene information' section of the record. 
Examples can be found using available Gene queries [e.g. 
'readthrough parent' (properties)]. In the last year, RefSeq 
curators reviewed the data reported in the ConjoinG 
database (12) to expand representation of this type. 
RefSeq currently tracks 120 human loci as an instantiated 
readthrough locus tracked with a distinct GenelD (for 
example, NME1-NME2, Gene ID 654364), and 358 loci 
with any type of readthrough association which includes 
reports of readthrough transcripts that do not meet the 
requirements to represent in RefSeq (for example, 
GenelD 6728). 

Non-protein-coding transcripts. RefSeq representation of 
non-coding RNAs grew by 30% between September 
2010 and 2011. Non-coding transcripts are managed in 
part by downloading other publicly available data sets 
including that available from miRBase http://www 
.mirbase.org/ (13). MicroRNAs, represented in RefSeq 
as the stem-loop precursor product with feature annota- 
tion of the functional RNA product, currently number 
6848 records, 1409 of which are for human. Other types 
of functional RNAs, for instance small nucleolar RNAs, 
may be initially defined by HGNC, or by NCBI curation 
staff. Long non-coding transcripts, including splice 
variants, have been added to the database as well. Some 
of these include transcripts considered unlikely to encode a 
protein for several reasons, including non-sense-mediated 
decay issues, inhibitory alternative open reading frames or 
an alternate splice variant for which there are concerns 
about significant protein truncation or numerous 
upstream ORFs. There are currently 6057 non-coding 
transcripts for 4421 human genes, including 1134 
non-coding transcripts for 797 human protein-coding loci. 



Expanded feature annotation. RefSeq feature annotation 
has expanded to indicate localization or function, and to 
highlight details of the sequence considered during manual 
review. For many years, RefSeq protein records have dis- 
played protein annotation computed by NCBI's CDD 
group, including protein domains, intra- or inter- 
molecular binding sites and metal-binding sites. While 
some signal peptide, mature peptide and other features 
have been manually annotated by NCBI staff, these 
features are now also propagated from UniProtKB/ 
Swiss-Prot records and predicted by SignalP 4.0 (14). 
Criteria for propagation include a high-quality alignment 
and confirmation that the sequence and feature length are 
consistently maintained. Feature types that are already 
provided by the CDD group are not propagated. The 
source of the annotated feature is indicated with a '/infer- 
ence' qualifier which cites SignalP4.0 or with a note 
'propagated from UniProtKB/Swiss-Prot' and an indica- 
tion of the Swiss-Prot accession number (for example, see 
NP_001028219.1, NP_001171622.1). 

Protein-coding RefSeq transcripts now display evidence 
for the 5' completeness of the annotated coding region 
following a computational search for an in-frame stop 
codon upstream of the annotated start codon. Identified 
stop codons are annotated with a misc_feat (see Figure 1 
and NM_145204.3). Non-protein-coding transcripts, 
when provided for a protein-coding gene, are also compu- 
tationally analysed to identify an open reading frame that 
shares the same start codon as a protein-coding transcript 
(for that gene) but that renders the transcript a candidate 
for non-sense-mediated mRNA (NMD) decay. Putative 
NMD ORFs are annotated with a misc_feat (see 
NR_040252.1). Misc_feat annotation is also added to a 
non-coding transcript if it contains an upstream ORF 
likely to be inhibitory to translation of the predicted 
ORF (see CCDS documentation at http://www.ncbi.nlm 
.nih.gov/CCDS/docs/CCDS-AUGguidelines.pdf and 
NR_003253.1 for an example). 



GENOME ANNOTATION POLICY 

Assembled genome sequence data are selected for inclu- 
sion in RefSeq based on several considerations including 
quality and completeness of a sequencing project, phylo- 
genetic distance, model organism status, impact on disease 
and health studies, and identified utility to targeted 
research projects. Over the last several years, NCBI has 
developed robust whole genome annotation pipelines for 
both prokaryotes and eukaryotes. The prokaryotic 
pipeline has matured to the point that it is routinely 
offered as a service to submitters if genome sequences 
are submitted to GenBank without annotation. RefSeq 
genome representation for prokaryotes is currently 
managed by propagating annotation from the primary 
genome data in GenBank, calculating annotation for 
RefSeq when annotation is not available in GenBank 
within 6 months of submission of the genome, supple- 
mented with curation to represent rRNAs, tRNAs and 
to provide improved protein names based on curated 
protein clusters. Eukaryotic genomes are managed based 
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Figure 1. NM_145204.3 is shown in the Nucleotide Graphical display format. The display was configured to show the six-frame translation track 
restricted to the sense strand, and to add three markers highlighting the annotated upstream in-frame stop codon, the translation initiation codon 
and a second in-frame AUG codon located further downstream. The observation of a stop codon upstream of, and in the same reading frame, 
suggests the annotated CDS is 5' complete. 



on general taxonomic groups, availability of submitted 
annotation and existence of an active model organism 
database. For mammalian genomes included in RefSeq, 
genome annotation is always provided using the NCBI 
annotation pipeline. For other organisms, RefSeq 
genome annotation is propagated from GenBank when 
available. Otherwise, annotation is provided using 
NCBFs eukaryotic annotation pipeline if a quality 
genome assembly is submitted with no intent to 
annotate, or if annotation is not submitted within a rea- 
sonable period of time, or is considered to need updating 
and the research group is not able to maintain it over time. 
When possible, the RefSeq group works with research 
communities and model organism databases to provide a 
single standard annotation for the reference genome; 
examples include Drosophila melanogaster , Arabidopsis 
thaliana, Anopheles gambiae, Saccharomyces cerevisiae 
and Escherichia coli K-12. 



annotation mentioned above is a small step in this direc- 
tion that will be further extended. Curators and program- 
mers supporting the vertebrate RefSeq data set store a 
wide variety of gene and transcript data attributes that 
are potentially of use to consumers of the RefSeq data 
set. Attribute categories, and available stored data, are 
being reviewed and a subset will be selected for reporting 
in a structured comment on RefSeq records. Examples of 
stored attributes include reported RNA editing, potential 
alternate translation initiation codons, loci reported to be 
imprinted, use of non-AUG initiation codons and more. 
In addition, the vertebrate RefSeq group is working on 
reporting more explicit information about the underlying 
support for the exon combination that is instantiated in a 
RefSeq transcript record, to highlight proteins that are 
highly conserved, and to provide a comparison utility to 
evaluate putative functional consequence among tran- 
script variants. 



FUTURE DIRECTIONS 

One of the short-term goals for the RefSeq group is to be 
more transparent with regard to curation decisions and 
support evidence used. The expanded transcript feature 
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